Weather Model Methods and Results

Author
Affiliation

Conor Olive

1 Methods

1.1 Observations

Our observations are obtained from Castro (2023) and contain U.S. weather stations and SNOTEL stations, providing data in the period from 1 January 2017 to 21 September 2017. Columns of the observations include

Name Data Type Description
station string Name of the weather station
state string State/province of the weather station
latitude double Latitude of the weather station (degrees)
longitude double Longitude of the weather station
elevation double Elevation of the weather station (meters)
date string Date of the observation
temp_min double Minimum daily temperature (fahrenheit)
temp_max double Maximum daily temperature (fahrenheit)
temp_avg double Average daily temperature (fahrenheit)
av_day_wi_spd double Average daily wind speed
wi_dir_5_sec double Wind direction at 5 second intervals (degrees)
wi_spd_5_sec double Wind speed at 5 seconds intervals
snow_fall double Snow water equivalence
snow_dep double Snow depth
precip double Precipitation

Our interest is in creating a model from the available data using

  • latitude
  • longitude
  • elevation
  • date

as predictors of temp_avg. To prepare the data for modeling, we first load the data.

weather <- read.csv("./data/weather.csv")

Next, we remove columns from the dataframe we are not interested in.

# Keep only our x and y columns from the data table
weather <- weather[c("date", "longitude", "latitude", "elevation", "temp_avg")]

# Drop observations with NA values for our x and y columns
weather <- weather %>% drop_na()

Since the date is not useful as a string for fitting a model, we replace it with the number of days since our first observations on 1 January 2017.

# Convert "date" column from string to date objects
weather$date <- as.Date(as.character(weather$date), format="%Y%m%d")

# Create new column counting the days since Jan 1 2017
weather$day <- as.numeric(difftime(weather$date, as.Date("20170101", format="%Y%m%d"), units="days"))

# Remove date column
weather <- weather[c("day", "longitude", "latitude", "elevation", "temp_avg")]

Finally, we restrict our data to the lower 48 U.S. states and the District of Columbia. To accomplish this, we must:

  • Convert our data.frame into a geospatial sf data frame object
  • Load shapefiles for the area of interest
  • Exclude observations spatially located outside of the shapefile
  • Convert the sf back into a data.frame object

The process for this follows below.

# Convert our data frame to sf
weather_sf <- st_as_sf(weather, coords = c("longitude", "latitude"), crs=st_crs("EPSG:4269"))

# Load U.S. States shapefile
states_sf <- read_sf("./data/cb_2018_us_state_500k/cb_2018_us_state_500k.shp", crs=st_crs("EPSG:4269"))

# Exclude specific states and territories
states_sf <- subset(states_sf, !(STATEFP %in% c("02","15","60","66","69","72","78")))

# Remove observations outside the remaining states and territories
weather_sf <- st_intersection(weather_sf, st_union(states_sf))

# Convert back into data frame
weather <- st_drop_geometry(weather_sf)
weather$longitude <- st_coordinates(weather_sf)[,"X"]
weather$latitude <- st_coordinates(weather_sf)[,"Y"]

We now have data to which we may properly fit our model.

Before continuing, it is worth noticing the spatial distribution of the observations in Figure 1. Specifically, there is a much higher distribution of observations in the inter-mountain west, which has implications for our model. These will be discussed in more detail later.

Figure 1: Spatial distribution of the observations within the data set.

We may plot the data in two dimensions at a time and see what kind of trends, if any, exist in the data. First, looking at the effect of latitude on temp_avg

First looking at the data, it seems trends do exist in the data. Looking at Figure 2 (b), there is a clear and negative linear correlation between elevation and temp_avg Additionally, there is a clear and negative linear correlation between latitude and temp_avg. The two remaining predictors, longitude and days, also appear to have some type of trend, but it is not clear that they are linear.

Based from physical understanding of the weather, the correlation both latitude and elevation have are intuitive. First, atmospheric pressure decreases in proportion to elevation. In turn, Gay-Lussac’s law informs us that \[P \propto T.\] Therefore, we transitively expect temperature to decrease in proportion to elevation, all else being equal.

In addition, background knowledge about solar irradiance informs us that irradiance, which is the primary forcing effect in surface temperature, decreases in intensity in proportion to distance from the solar equator, due to the angle of incidence at which solar rays shine on the Earth.

Demontration of the relationship between angle of incidence and intensity of solar irradiance.

Ignoring the effect of the atmosphere on solar irradiance, geometric reasoning tells us that solar itensity \(I\) at a given latitude, where \(I_0\) is the intensity at the equator, is \[ I = I_0 \cos{\theta}.\] However, in our data, where our latitude varies from approximately 25 to 50 degrees, this function \(I\) is roughly linear, as shown in Figure 3.

Figure 3: Theoretical irradiance versus latitude in the lower 48 states.

In addition, if we interpolate the data, we may view how this data changes in space and over time, though the effect of elevation is not visualized. The inverse weighted interpolation is computationally intensive, and so patience is required to generate the resulting visualization in Figure 4.

Figure 4: Inverse distance weighted interpolation of temp_avg from each station for each day in the data set.

1.2 Procedure

Owing to the fact that this model is a combination of both linear and highly non-linear physical effects, we choose to fit the data set with a Generalized Additive Model, implemented by Hastie (2023). In this type of model, we explain our outcome variable \(\hat{y}\) as the sum of smooth functions on our predictors \(x_i\), i.e.

\[\hat{y} = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n).\] For our particular predictors, we use a simple linear function \(f_i(x_i) = \beta_ix_i\) for elevation, latitude and longitude, and a nonlinear smooth function for days. Then, our particular model is

\[\text{temp_avg} = \beta_0 + \beta_1 (\text{elevation}) + \beta_2 (\text{latitude}) + \beta_3 (\text{longitude}) + f(\text{days}).\] since we hypothesize the partial derivatives of temp_avg with respect to longitude, latitude and elevation to be roughly constant.

Before we perform regression, we first split the the data in weather randomly into training and testing sets with a 7 to 3 split.

# Randomly choose our indices to sample from for our training set
sample_frac <- 0.7
train_inds <- sample(c(TRUE, FALSE), nrow(weather), replace=TRUE, 
                     prob=c(sample_frac,1-sample_frac))

# Create training set from training set indices
train <- weather[train_inds,]
train_x <- train[, c('elevation','longitude','latitude','day')]
train_y <- train[, c('temp_avg')]

# Create testing set from negation of training set indices
test <- weather[!train_inds,]
test_x <- test[, c('elevation','longitude','latitude','day')]
test_y <- test[, c('temp_avg')]

Now we fit our model.

library("gam")

gam_weather <- gam(temp_avg ~ elevation + longitude + latitude + s(day,df=5), data=train)

2 Results

The resulting coefficients \(\beta_i\) in our gam_weather model are

Using our testing set of data, we calculate the root mean square deviation

# Use the model to make prediction based on our training set 
gam_predict <- predict(gam_weather, test_x)

# Calculate RMSE on the residuals of our prediction and training set avg_temp
gam_rmse <- sqrt(mean((gam_predict - test_y)^2))

giving us a root mean square deviation of 7.8637781. Partial residual plots for each of the independent predictors are shown below in Figure 5.

Figure 5: Partial residuals and components of the model.

In addition, we may also show residuals per station

Figure 6: Temperature residuals by station over time.

3 Discussion

3.1 Limitations

3.1.1 Limitations of the Data

Several limitations exist on what may and may not be concluded from this data and the analysis employed. For one, the data is limited to only part of the year 2017, and geographically constrained to the lower 48 states of the United States. As a result, all of the observed trends cannot be extrapolated beyond this region and this year without employing outside knowledge from the climate sciences.

With this caveat in mind, this data does corroborate and certain known phenomenon we hypothesized earlier would be shown; in particular that

  • Temperature decreases with proportion to elevation
  • Temperature increases with proportion to absolute difference in latitude from the solar equator

In addition, linear modeling of the trend across longitude suggested that, across the lower 48 states, temperature decreased with proportion to longitude. In general, this is a trend driven by an array of confounding atmospheric and climatological phenomenon, in particular

  • Jet streams
  • North American topology
  • Ocean currents

for which data is not included in the study, and thus these causes cannot be identified without analysis of additional sets of data.

Lastly, there are some limitations in the data, specifically resulting from the spatial distribution of our observations, as shown in Figure 1. Because of the over representation of observations in the inter-mountain west region, our error is greatest in the eastern US, where stations are lower in density. This becomes particularly evident around day 150, as shown in Figure 6, where much of the residuals in the west are small, and but predicting colder weather than is actually observed in most of the eastern US.

3.1.2 Limitations of the Model

Weather systems and atmospheric dynamics are best explained through fluid dynamics and simulations through global circulation models (GCMs). Fitting any linear or nonlinear model will inherently result in a model that cannot be generalized outside the time or region. However, our model still remains useful as a tool for observing inner-year trends across the geographical region(s) and time period(s) observed.

4 Conclusions

In conclusion, our study aimed to model average daily temperature using a combination of geographical and temporal predictors. We utilized observational data collected from U.S. weather stations and SNOTEL stations, covering the period from January 1, 2017, to September 21, 2017. The predictors considered in our model included latitude, longitude, elevation, and date. Fitting a Generalized Additive Model to the data yieled acceptable results with a RMSE of 7.8637781 and captured several interesting trends consistent with theory, namely

  • a negative correlation between latitude and temperature
  • a negative correlation between altitude and temperature

as well slight linear correlation between longitude and temperature and a nonlinear trend in time, resulting from complex nonlinear interactions between jet streams, ocean currents, and the rugged topography of the inter-mountain west. The resulting model is a reasonably good predictor of temperature within the timeframe of the data, especially within the inter-mountain west, but poorer outside this region. For higher accuracy, a full global circulation model would perform better, but our model presented here is many order of magnitudes simpler to compute, more interpretable, and yet still reasonably accurate.

5 References

Castro, Spencer. 2023. “Weather Data.” https://catcourses.ucmerced.edu/files/6731022/download?download_frd=1.
Gräler, Benedikt, Edzer Pebesma, and Gerard Heuvelink. 2016. “Spatio-Temporal Interpolation Using Gstat.” The R Journal 8: 204–18. https://journal.r-project.org/archive/2016/RJ-2016-014/index.html.
Hastie, Trevor. 2023. Gam: Generalized Additive Models.
Hijmans, Robert J. 2023. Raster: Geographic Data Analysis and Modeling. https://rspatial.org/raster.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

6 Appendix

6.1 Interpolation Code

library(tidyverse)
library(sf)
library(gstat)
library(raster)
library(gganimate)
library(av)

# Get the first and last day
day_min <- min(weather_sf$day)
day_max <- max(weather_sf$day)
bbox <- st_bbox(st_union(states_sf))

# Create a list of rasters
rasters <- list()

for (d in day_min:day_max) {
  # Choose weather data from a single day
  daily <- subset(weather_sf, day == d)
  date <- as.Date("20170101", format="%Y%m%d") + hms("24:00:00")*d
  
  # Interpolate the weather station data
  gs <- gstat::gstat(formula=temp_avg~1, data=daily, nmax=5, set=list(idp = 0))
  grid <- raster::raster(ncol=200, nrow=200, xmn=bbox$xmin, xmx=bbox$xmax, ymn=bbox$ymin, ymx=bbox$ymax, crs="EPSG:4269")
  
  # Create a raster from the interpolated data
  rs <- raster::interpolate(grid, model=gs)
  rsm <- raster::mask(rs, as_Spatial(st_union(weather_sf)))
  
  # Save the result for the day
  file <- writeRaster(rsm, filename = paste('./avgdaily/', d, ".tif", sep=""), format="GTiff", overwrite=TRUE)
}

# Create a list of the rasters and form a raster brick
rasters <- list()
for (d in day_min:day_max) { rasters[[length(rasters)+1]] <- raster(paste("./avgdaily/", d, ".tif", sep=""))}
br <- raster::brick(rasters)

# Create a dataframe from the  raster brick
rsm_df <- purrr::map_dfr(
  as.list(br),
  ~setNames(as.data.frame(as(., "SpatialPixelsDataFrame")), c('temp_avg', 'x', 'y')),
  .id = 'day'
)
rsm_df$day <- as.numeric(rsm_df$day)

# Plot the data
gga <- ggplot() + 
  geom_tile(data=rsm_df, mapping=aes(x=x,y=y,fill=temp_avg)) +
  scale_fill_viridis_b(name="Temperature (F)", na.value = "transparent", limits=c(temp_avg_min, temp_avg_max)) +
  labs(x = "Longitude", y = "Latitude") +
  geom_sf(data=L48_states_sf, alpha=0) +
  theme_minimal() +
  gganimate::transition_time(as.numeric(day)) +
  ggtitle("Average Daily Temperature (Day {frame-1})")

gganim <- gganimate::animate(gga, renderer=av_renderer(), nframes=day_max, height=1080, width=1920, units="px", res=300)
anim_save("./avgdaily.mp4", gganim)